Skip to main content

Wide Crawl started March 2011

Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.

What’s in the data set:

Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069

The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.

However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.

We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with. We have also done some further analysis of the content.

If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it. We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.

8,528
RESULTS
rss


PART OF
Worldwide Web Crawls
Internet Archive Web Crawls
Media Type
8,528
web
Topics & Subjects
8,528
crawldata
Collection
8,528
Web Crawls
8,528
Internet Archive Web Crawls
8,528
Wide Crawl started March 2011
8,528
Worldwide Web Crawls
Creator
8,528
internet archive
SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
Wide Crawl started March 2011
web
eye 418,657
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl415.us.archive.org:wide from Sun Aug 14 03:13:23 PDT 2011 to Sat Aug 13 22:20:18 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 221,756
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl416.us.archive.org:wide from Thu Jun 23 21:11:43 PDT 2011 to Thu Jun 23 14:53:15 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 216,384
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl448.us.archive.org:argov from Wed Aug 24 01:32:00 PDT 2011 to Thu Sep 1 12:02:59 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 215,243
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl410.us.archive.org:wide from Fri May 13 21:19:40 PDT 2011 to Fri May 13 22:10:12 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 181,792
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl423.us.archive.org:wide from Wed Apr 27 01:17:43 PDT 2011 to Tue Apr 26 20:39:17 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 150,630
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Sun May 1 01:24:09 PDT 2011 to Sat Apr 30 22:56:48 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 148,416
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Wed May 11 12:22:46 PDT 2011 to Wed May 11 08:39:50 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 148,056
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl420.us.archive.org:wide from Tue Apr 26 17:43:13 PDT 2011 to Tue Apr 26 12:42:46 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 142,157
favorite 0
comment 0
Internet Archive crawldata from Friendster Blogs Crawl, captured by crawl439.us.archive.org:friendster from Thu Jun 23 01:41:07 PDT 2011 to Wed Jun 22 19:53:32 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 142,133
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl421.us.archive.org:wide from Thu Oct 6 05:28:56 PDT 2011 to Thu Oct 6 00:58:42 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 139,762
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl421.us.archive.org:wide from Sat Jul 16 07:24:18 PDT 2011 to Sat Jul 16 00:57:31 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 129,987
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl417.us.archive.org:wide from Sat Apr 30 16:45:54 PDT 2011 to Sat Apr 30 21:10:35 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 126,504
favorite 0
comment 0
Internet Archive crawldata from Friendster Blogs Crawl, captured by crawl439.us.archive.org:friendster from Fri Jun 24 05:16:06 PDT 2011 to Thu Jun 23 22:45:10 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 124,399
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl411.us.archive.org:wide from Mon Jun 6 22:08:59 PDT 2011 to Mon Jun 6 16:02:13 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 120,563
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl416.us.archive.org:wide from Thu Jun 23 20:04:13 PDT 2011 to Thu Jun 23 14:00:57 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 114,149
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl410.us.archive.org:wide from Thu Apr 21 18:29:23 PDT 2011 to Thu Apr 21 17:57:54 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 113,855
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl425.us.archive.org:wide from Fri Jul 22 05:35:21 PDT 2011 to Thu Jul 21 23:36:33 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 113,102
favorite 0
comment 0
Internet Archive crawldata from Friendster Blogs Crawl, captured by crawl439.us.archive.org:friendster from Thu Jun 23 03:05:40 PDT 2011 to Wed Jun 22 20:25:57 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 112,342
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl422.us.archive.org:wide from Fri Apr 29 00:28:27 PDT 2011 to Thu Apr 28 20:15:25 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 110,184
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl448.us.archive.org:argov from Wed Aug 24 22:48:30 PDT 2011 to Thu Sep 1 12:13:09 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 107,554
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl416.us.archive.org:wide from Sun Jul 3 19:45:18 PDT 2011 to Sun Jul 3 13:40:14 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 107,413
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Sat Apr 30 12:31:45 PDT 2011 to Sat Apr 30 11:48:18 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 107,411
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Sun May 1 05:56:48 PDT 2011 to Sun May 1 03:35:51 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 106,607
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl415.us.archive.org:wide from Mon Aug 8 22:37:24 PDT 2011 to Mon Aug 8 19:58:57 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 103,604
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl421.us.archive.org:wide from Tue Apr 26 01:41:22 PDT 2011 to Mon Apr 25 20:29:54 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 97,414
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Sat Apr 30 08:13:45 PDT 2011 to Sat Apr 30 06:46:45 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 96,696
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl338.us.archive.org:wide from Wed Mar 9 00:51:25 PST 2011 to Tue Mar 8 19:52:22 PST 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 95,064
favorite 0
comment 0
Internet Archive crawldata from Friendster Blogs Crawl, captured by crawl439.us.archive.org:friendster from Fri Jun 24 07:39:04 PDT 2011 to Fri Jun 24 01:20:52 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 94,317
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl427.us.archive.org:wide from Sat Aug 27 21:31:42 PDT 2011 to Sat Aug 27 22:07:22 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 94,243
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl418.us.archive.org:wide from Thu May 12 19:16:19 PDT 2011 to Thu May 12 17:08:02 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 91,801
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl411.us.archive.org:wide from Tue Jun 14 06:14:59 PDT 2011 to Tue Jun 14 00:23:44 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 91,293
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl425.us.archive.org:wide from Thu Jul 21 18:13:35 PDT 2011 to Thu Jul 21 12:04:47 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 89,981
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl413.us.archive.org:wide from Sun Jul 3 05:22:17 PDT 2011 to Sat Jul 2 23:10:34 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 89,179
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl424.us.archive.org:wide from Fri Apr 29 18:33:24 PDT 2011 to Fri Apr 29 14:42:40 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 88,359
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl425.us.archive.org:wide from Fri Apr 29 18:34:30 PDT 2011 to Fri Apr 29 14:28:25 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 88,187
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl423.us.archive.org:wide from Sat Jul 16 06:47:12 PDT 2011 to Sat Jul 16 00:46:51 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 85,749
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl413.us.archive.org:wide from Thu Apr 21 18:53:32 PDT 2011 to Fri Apr 22 04:21:25 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 82,729
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl416.us.archive.org:wide from Wed Jul 20 11:51:54 PDT 2011 to Wed Jul 20 07:21:56 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 80,378
favorite 0
comment 0
Internet Archive crawldata from Friendster Blogs Crawl, captured by crawl439.us.archive.org:friendster from Fri Jun 24 04:41:37 PDT 2011 to Thu Jun 23 22:14:12 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 78,889
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl419.us.archive.org:wide from Thu May 12 20:32:45 PDT 2011 to Thu May 12 16:18:38 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 78,244
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl425.us.archive.org:wide from Wed Jul 27 22:51:50 PDT 2011 to Wed Jul 27 16:59:32 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 77,597
favorite 0
comment 0
Internet Archive crawldata from Friendster Blogs Crawl, captured by crawl439.us.archive.org:friendster from Thu Jun 23 16:46:29 PDT 2011 to Thu Jun 23 10:34:07 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 76,985
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl411.us.archive.org:wide from Tue Apr 19 23:14:47 PDT 2011 to Thu Apr 21 11:32:30 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 76,967
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl414.us.archive.org:wide from Sat Apr 23 00:19:37 PDT 2011 to Sat Apr 23 04:18:58 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 76,846
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl412.us.archive.org:wide from Tue Sep 27 13:48:57 PDT 2011 to Tue Sep 27 07:46:21 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 76,834
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl425.us.archive.org:wide from Tue Jul 26 18:06:41 PDT 2011 to Tue Jul 26 12:55:43 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 76,816
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl413.us.archive.org:wide from Wed Jul 13 12:01:00 UTC 2011 to Wed Jul 13 12:57:54 UTC 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 76,622
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl420.us.archive.org:wide from Wed Apr 27 07:58:43 PDT 2011 to Tue Jun 14 13:40:31 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 76,596
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl416.us.archive.org:wide from Fri Jun 24 01:53:49 PDT 2011 to Thu Jun 23 19:38:20 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 75,287
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl413.us.archive.org:wide from Sat Sep 24 16:30:43 PDT 2011 to Sat Sep 24 10:21:46 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 74,878
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl411.us.archive.org:wide from Sat Aug 27 10:49:07 PDT 2011 to Sat Aug 27 05:40:30 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 74,555
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Sat Apr 30 18:51:58 PDT 2011 to Sat Apr 30 17:48:25 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 73,968
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Sat Aug 27 10:03:44 PDT 2011 to Sat Aug 27 04:21:29 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 73,261
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl425.us.archive.org:wide from Wed Jun 15 01:13:49 PDT 2011 to Tue Jun 14 20:47:44 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 73,059
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl421.us.archive.org:wide from Thu Oct 6 02:51:02 PDT 2011 to Wed Oct 5 21:45:35 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 72,957
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl415.us.archive.org:wide from Wed Apr 20 01:03:39 PDT 2011 to Thu Apr 21 22:48:08 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 72,341
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Tue Jul 26 15:28:09 PDT 2011 to Tue Jul 26 09:53:32 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 72,323
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl412.us.archive.org:wide from Sat May 21 07:56:36 PDT 2011 to Sat May 21 01:56:05 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 72,266
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl418.us.archive.org:wide from Fri May 13 00:44:55 PDT 2011 to Fri May 13 00:41:56 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 71,005
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Sat Aug 27 23:14:07 PDT 2011 to Sat Aug 27 21:39:19 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 70,861
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl423.us.archive.org:wide from Wed Jun 15 03:06:30 PDT 2011 to Tue Jun 14 21:53:14 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 69,460
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl423.us.archive.org:wide from Mon Jul 18 02:15:05 PDT 2011 to Sun Jul 17 22:05:24 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 68,884
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl415.us.archive.org:wide from Mon May 16 04:04:08 PDT 2011 to Mon May 16 01:06:29 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 68,710
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl412.us.archive.org:wide from Wed Jul 20 01:09:57 PDT 2011 to Tue Jul 19 19:06:52 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 68,649
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl417.us.archive.org:wide from Wed Jul 6 13:34:31 PDT 2011 to Wed Jul 6 07:42:44 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 68,596
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Fri Aug 26 07:13:39 PDT 2011 to Fri Aug 26 04:59:12 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 68,318
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl422.us.archive.org:wide from Fri Apr 29 03:28:27 PDT 2011 to Thu Apr 28 21:48:35 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 68,117
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl428.us.archive.org:wide from Wed May 11 05:57:13 PDT 2011 to Wed May 11 02:09:02 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 67,585
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl421.us.archive.org:wide from Fri Apr 29 00:28:01 PDT 2011 to Fri Apr 29 01:47:45 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 67,505
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl413.us.archive.org:wide from Sun Apr 10 23:11:45 PDT 2011 to Tue May 3 15:40:53 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 67,305
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl423.us.archive.org:wide from Sat Jul 16 09:12:32 PDT 2011 to Sat Jul 16 03:34:11 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 66,896
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl424.us.archive.org:wide from Tue Sep 20 11:02:23 PDT 2011 to Tue Sep 20 06:46:53 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 66,864
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl418.us.archive.org:wide from Fri May 13 15:51:53 PDT 2011 to Fri May 13 19:08:43 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 66,779
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl416.us.archive.org:wide from Sat Apr 30 16:46:07 PDT 2011 to Sat Apr 30 15:17:39 PDT 2011.
Topic: crawldata
Wide Crawl started March 2011
web
eye 66,510
favorite 0
comment 0
Internet Archive crawldata from Friendster Blogs Crawl, captured by crawl439.us.archive.org:friendster from Thu Jun 23 03:26:43 PDT 2011 to Wed Jun 22 20:46:48 PDT 2011.
Topic: crawldata